Gender, Degrees and Wages:
The Bayesian Workflow Applied to US Census Data

Thilde Terkelsen & Timo Röder

May 14, 2018

Introduction

In this project we apply the Bayesian workflow to US census data from the years 2012 to 2016 in order to model wages and gender wage differences for college graduates from three academic fields living in New York state and Texas.

The gender wage gap, i.e. the average difference in pay between men and women, continues to be a subject of great debate in academia, media, politics as well as in more informal settings. Extensive studies on the wage gap’s underlying mechanisms and its historical trends have arrived at somewhat varying conclusions. This is not too surprising: setting up appropriate study designs and statistical models to study people’s wages is a fairly complex endeavour, as there are countless factors influencing the wage any individual receives. Variables such as educational level, field of study, maternity leave, work-life balance or even the income of an individual’s parents, to name a few, confound the question of whether the differences in salaries observed is in fact due to gender bias or simply an artefact of other social, economic or familial circumstances.

An extensive report titled “Graduating to a Pay Gap” by Corbett and Hill (2012) 1 Corbett, Christianne, and Catherine Hill. “Graduating to a Pay Gap: The Earnings of Women and Men One Year after College Graduation. American Association of University Women.” 1111 Sixteenth Street NW, Washington, DC 20036 (2012). explores the gender difference in US wages one year after graduation. Corbett and Hill compared the salaries of recent graduates who reported working the same number of hours per week, and found that one year after graduation women earned ~18% less than their male counterparts. As there are differences regarding the college majors and jobs that men and women tend to choose – for instance, men are more likely to choose fields with relatively higher salaries such as engineering and computer science – the study grouped college majors into more general categories. Analyses were then performed separately for each category and compared afterwards. While the study only found minor wage differences for majors in humanities, education and health sciences, there were significant gender wage gaps within the fields of engineering, computer science, business and social sciences.

While the gender wage gap has shrunk in the US and other developed countries over the last 30 years, this wage convergence seems to stagnate at the moment 2 Heathcote, Jonathan, Kjetil Storesletten, and Giovanni L. Violante. “The macroeconomic implications of rising wage inequality in the United States.” Journal of political economy 118.4 (2010): 681-722.. Along similar lines, a recent study of historical labour data has questioned the notion that some of the gender wage gap can still be attributed to gender differences in education 3 Blau, Francine D., and Lawrence M. Kahn. “The gender wage gap: Extent, trends, and explanations.” Journal of Economic Literature 55.3 (2017): 789-865.. In the study, Blau and Kahn show that while in 1980 approximately 21.1% of the wage gap could be explained by work experience and 2.6% by educational level, by 2010 those numbers have decreased to 14.1% and -5.9% respectively, indicating that education is not a factor at all – if anything, women nowadays earn too little compared to their educational level.

Coming back to this project, it should be said that our scope is a lot smaller compared to these studies. We have not even attempted to disentangle the possible mechanisms underlying the wage gap.

Instead, our approach is rather descriptive: we set out to investigate wage as a function of people’s age and how this association may affect the gender wage gap – does the gap grow or shrink when looking at different age cohorts? Is such an effect equally strong across different fields of work and different states or are wages consistently more equal in some of these groups ? These are questions we would like to approach by applying the Bayesian workflow presented in the course.

While we cannot claim that the presented work provides ground-breaking answers to any unsolved research questions surrounding the gender wage gap, we hope that our report is still an informative and enjoyable read.

Data

From raw data to start of analysis

The data sources as well as some preliminary data cleaning steps are inspired by a FiveThirtyEight article from 2014: The Economic Guide To Picking A College Major.

Our primary data source is the American Community Survey (ACS) Public Use Microdata Sample (PUMS) conducted by the United States Census Bureau. The ACS conducts surveys of around 3.5 million households each year.

We are using the multiyear PUMS file covering census data for the years 2012 to 2016, which contains records on ~16 million individual residents. It can be downloaded as zip file by selecting “United States Population Records” from this website. The zip file contains four large csv files which together cover the complete dataset. Additional information on the PUMS files is provided on the US Census Bureau’s website: they provide a general overview (link to PDF) as well as a data dictionary (link to PDF).

In the first data cleaning step, we remove individuals who did not graduate college (see R script here) and write the remaining 3,489,070 entries to a file called us16_filtered.csv.

Next up, we add information about college majors by merging our dataset with a lookup file called majors-list.csv prepared by FiveThirtyEight (GitHub link). This lookup file matches the FOD1P codes used in the PUMS to actual names of the corresponding college majors. Additionally, these college majors are grouped into categories as used in a major 2011 study 4 Carnevale et al., “What’s It Worth?: The Economic Value of College Majors.” Georgetown University Center on Education and the Workforce, 2011. Website.

You can find the R script we used to merge datasets on GitHub. This script also converts all wages reported before 2016 to inflation-adjusted wages equivalent to 2016 dollars, before writing selected columns for the 3,489,070 entries to a file called selected_vars.csv.

Intermission: what is it we actually want to do?

Before moving on with the data preparation steps, it is probably a good idea to set the stage regarding what we want to model and why. Hopefully, this will make some of the following sections easier to follow.

As touched upon in the introduction, our intention here is to run relatively straightforward hierarchical regression models in order to model wages across different ages. Doing this in a Bayesian setting of course allows us to draw posterior predictive samples from the fitted models and we can use these posterior draws to answer some questions about the data.

For instance, you will see that the wage gap in absolute dollar values is substantially smaller when looking at education graduates as opposed to business graduates. However, wages are generally lower in the education group, so the observation about the wage gap might just be a side effect of this general difference in wages. Drawing posterior predictive samples allows us to check if the difference in the wage gap holds up when looking at relative values rather than absolute ones.

Selecting the data subset

The file selected_vars.csv is the starting point for our actual analysis:

library(readr)
selected_vars <- read_csv("data/selected_vars.csv")
selected_vars
## # A tibble: 3,489,070 x 9
##     SERIALNO    ST  AGEP   SEX FOD1P Major  Major_Category real_total_earn
##        <dbl> <int> <int> <int> <int> <chr>  <chr>                    <dbl>
##  1   2.01e¹²    13    69     1  4101 PHYSI… Industrial Ar…           36961
##  2   2.01e¹²    13    53     2  2304 ELEME… Education                41185
##  3   2.01e¹²    13    34     2  1100 GENER… Agriculture &…            4435
##  4   2.01e¹²    13    56     1  6203 BUSIN… Business                 13728
##  5   2.01e¹²     6    49     1  2414 MECHA… Engineering              72866
##  6   2.01e¹²     6    48     2  3600 BIOLO… Biology & Lif…          211206
##  7   2.01e¹²     6    50     2  2300 GENER… Education                84482
##  8   2.01e¹²     5    67     2  6004 COMME… Arts                     10455
##  9   2.01e¹²     6    59     1  5003 CHEMI… Physical Scie…               0
## 10   2.01e¹²     6    54     2  2901 FAMIL… Industrial Ar…               0
## # ... with 3,489,060 more rows, and 1 more variable: real_wage <dbl>

It contains nine variables:

While it would of course be most appropriate to conduct an analysis on the complete dataset, we chose to focus only on a select subset of entries in order to keep both computation times and model interpretation manageable. Our chosen subset consists of entries from two states and three major categories.

We choose New York and Texas as two similarly large states which should serve as decent stand-ins for politically opposed blue states (NY) and red states (TX). The major categories we focus on are business, education and health. We select the data subset with the following R code:

library(dplyr)
nytx_df <- selected_vars %>% 
  filter(ST %in% c(36, 48),
         Major_Category %in% c("Education", "Health", "Business"),
         real_wage > 0) %>% 
  mutate(state = if_else(ST == 36, "NY", "TX"),
         sex = if_else(SEX == 1, "male", "female"))

Note that in the above code chunk, we also remove respondents who reported no wage and recode ST and SEX into the more explicit variables state and sex.

The resulting subset contains 148,490 entries, which can be broken down as follows:

library(tidyr)
nytx_df %>% 
  group_by(Major_Category, state, sex) %>% 
  summarise(entries = n()) %>% 
  spread(key = "sex", value = "entries") %>% 
  mutate(ratio = female / male,
         total = male + female) %>% 
  knitr::kable(., digits = 2, format.args = list(big.mark = ","),
               col.names = c("major category", "state", "female", "male",
                             "ratio (f/m)", "total"))
major category state female male ratio (f/m) total
Business NY 15,789 19,265 0.82 35,054
Business TX 18,888 24,829 0.76 43,717
Education NY 15,422 4,996 3.09 20,418
Education TX 17,681 5,024 3.52 22,705
Health NY 11,075 2,507 4.42 13,582
Health TX 10,417 2,597 4.01 13,014

We can see that both total number and gender ratio of entries for the two states are roughly comparable across the three major categories.

Data summaries

In order to get an overview of the general trend in wages as a function of age, we plot the mean and variation of yearly wages across the age range from 20 to 70:

library(ggplot2)
theme_set(custom_ggtheme)

nytx_df %>% 
  filter(AGEP >= 20,
         AGEP <= 70) %>% 
  group_by(AGEP, sex, Major_Category, state) %>% 
  summarise(mean_wage = mean(real_wage, na.rm = T),
            sd_wage = sd(real_wage, na.rm = T)) %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = mean_wage, colour = sex)) +
  geom_line(aes(y = mean_wage + 2 * sd_wage, colour = sex), linetype = "dashed") +
  ggtitle("Data summary: yearly wages (mean + 2 standard deviations)") +
  xlab("age") +
  ylab("yearly wages in USD") +
  facet_grid(state ~ Major_Category)

We can see differences in wages, slopes as well as in the gender wage gap when comparing the different panels. Nothing here is too surprising: business graduates have the highest earning potential, followed by graduates in the field of health, and education graduates coming in last. Furthermore, it appears that wages grow steadily through people’s thirties and forties until growth starts to stall around the age of fifty. This observation is fairly consistent across all panels.

Looking at the gender wage gap, the gap is consistently smallest within education graduates (at least in terms of absolute numbers). While the gap within health and business ends up being roughly similar in older age cohorts, the way there is different: women with a business background seem to earn less than men right out of the gate, whereas wages only start to separate around age 35 for health graduates.

On a more technical note, we can also see that there is a lot of variation in the wages within each age cohort. This is not unexpected, however it has implications for our model choice: If we were to go ahead and use a regression model that directly models wage (with Gaussian noise) as a function of age, it is very likely that we would end up with negative wages when drawing posterior predictive samples from that model.

In order to avoid this, we log-transform the wage numbers before running our models. This will ensure that our posterior predictive samples are positive (after exponentiating them) – and has the convenient side-effect of lowering the impact of high outlier wages.

model_df1 <- nytx_df %>% 
  mutate(log_wage = log(real_wage))

While using Gaussian distributions to describe the wage distributions at each age point is of course still an approximation, it is not a terribly outrageous one after log-transforming the wages. We can see this in the violin plots of select age cohorts (we only show eight because it would start to get cramped with more age cohorts):

model_df1 %>% 
  filter(AGEP %in% seq(25, 60, 5)) %>% 
  ggplot(., aes(x = factor(AGEP), y = log_wage)) +
  geom_violin() +
  ggtitle("Data summary: log-transformed yearly wages in eight age cohorts") +
  xlab("age") +
  ylab("log-transformed yearly wages") +
  facet_grid(state ~ Major_Category)

Linear model: a simple start

To kick things off, we fit a hierarchical generalised linear model with Gaussian noise. As we have discovered when looking at the data summary earlier, wages start to stagnate around the age of fifty. Therefore, we will only fit the linear model in the age span 25 to 50 – this is where the wage growth happens.

lin_df <- model_df1 %>% 
  filter(AGEP >= 25,
         AGEP <= 50)

The resulting data frame lin_df still contains 85,843 entries. In order to keep computation times manageable, we reduce the size of the dataset by taking a random sample of 5,000 entries:

set.seed(42)
prototype_lin_df <- lin_df %>% 
  sample_n(size = 5000)

Model fitting & convergence diagnostics

We can now go ahead and fit a hierarchical generalised linear model using the stan_glmer() function from rstanarm. Log-transformed wages will be fitted as a linear function of age with Gaussian noise. We want to allow for the slope of this function to be different between the genders, so we add sex as a grouping term. In addition, we specify the hierarchical groupings Major_Category and state.

Sampling takes place with rstanarm’s default settings: four chains with 2,000 samples, 1,000 of which are warm-up samples. Default priors are N(0, 2.5) for regression coefficients, N(0, 10) for intercepts and exponential(1) for sigma parameters of the Gaussian noise. Note that the prior parameters apply after the regression’s predictors are centred.

library(rstan)
library(rstanarm)

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores() - 1)
lin_fit <- stan_glmer(log_wage ~ (AGEP | sex) + 
                        (1 | Major_Category) + (1 | state),
                      data = prototype_lin_df, family = gaussian(), 
                      adapt_delta = 0.98)

Let’s have a look at the model fit summary, including parameter estimates and convergence diagnostics:

knitr::kable(lin_fit$stan_summary, digits = 2)
mean se_mean sd 2.5% 10% 25% 50% 75% 90% 97.5% n_eff Rhat
(Intercept) 10.03 0.02 0.58 8.76 9.49 9.81 10.04 10.26 10.53 11.06 599.10 1.01
b[(Intercept) Major_Category:Business] 0.09 0.01 0.30 -0.44 -0.13 -0.02 0.08 0.17 0.30 0.67 967.76 1.00
b[(Intercept) Major_Category:Education] -0.17 0.01 0.30 -0.67 -0.40 -0.28 -0.18 -0.08 0.05 0.40 972.32 1.00
b[(Intercept) Major_Category:Health] 0.11 0.01 0.30 -0.39 -0.11 0.00 0.10 0.20 0.33 0.70 951.42 1.00
b[(Intercept) Major_Category:_NEW_Major_Category] 0.00 0.01 0.49 -0.91 -0.38 -0.16 0.00 0.15 0.38 0.90 2876.48 1.00
b[(Intercept) state:NY] 0.06 0.02 0.45 -0.76 -0.23 -0.04 0.05 0.16 0.40 0.92 554.35 1.01
b[(Intercept) state:TX] -0.05 0.02 0.45 -0.87 -0.34 -0.15 -0.05 0.05 0.29 0.82 554.60 1.01
b[(Intercept) state:_NEW_state] 0.00 0.01 0.67 -1.33 -0.45 -0.14 0.00 0.14 0.47 1.30 3276.54 1.00
b[(Intercept) sex:female] 0.07 0.01 0.19 -0.21 -0.06 -0.01 0.04 0.14 0.26 0.51 1184.33 1.00
b[AGEP sex:female] 0.02 0.00 0.00 0.01 0.01 0.01 0.02 0.02 0.02 0.02 3118.43 1.00
b[(Intercept) sex:male] -0.09 0.01 0.19 -0.51 -0.29 -0.15 -0.05 0.00 0.05 0.22 1351.24 1.00
b[AGEP sex:male] 0.03 0.00 0.00 0.03 0.03 0.03 0.03 0.03 0.03 0.04 2581.58 1.00
b[(Intercept) sex:_NEW_sex] 0.01 0.00 0.29 -0.52 -0.21 -0.07 0.00 0.08 0.24 0.59 3312.71 1.00
b[AGEP sex:_NEW_sex] 0.00 0.00 0.11 -0.24 -0.09 -0.03 0.00 0.03 0.08 0.22 2025.84 1.00
sigma 0.93 0.00 0.01 0.91 0.92 0.93 0.93 0.94 0.94 0.95 4000.00 1.00
Sigma[Major_Category:(Intercept),(Intercept)] 0.23 0.02 0.93 0.01 0.01 0.03 0.06 0.16 0.47 1.60 1464.14 1.00
Sigma[state:(Intercept),(Intercept)] 0.44 0.05 1.48 0.00 0.00 0.01 0.05 0.26 1.00 3.44 1017.29 1.00
Sigma[sex:(Intercept),(Intercept)] 0.09 0.01 0.35 0.00 0.00 0.00 0.02 0.07 0.20 0.64 2120.05 1.00
Sigma[sex:AGEP,(Intercept)] 0.00 0.00 0.03 -0.05 -0.01 0.00 0.00 0.00 0.01 0.05 2242.27 1.00
Sigma[sex:AGEP,AGEP] 0.01 0.00 0.03 0.00 0.00 0.00 0.00 0.01 0.03 0.10 1192.68 1.00
mean_PPD 10.87 0.00 0.02 10.83 10.85 10.86 10.87 10.88 10.89 10.90 4000.00 1.00
log-posterior -6775.87 0.12 3.93 -6784.42 -6781.16 -6778.33 -6775.53 -6773.14 -6771.06 -6768.95 1088.16 1.00

All Rhat values are below 1.1, giving an indication that the chains have converged. We can also asses convergence by inspecting the trace plots of the sampling chains (the 1000 samples after warm-up are shown):

library(bayesplot)
theme_set(custom_ggtheme)
mcmc_trace(as.array(lin_fit), np = nuts_params(lin_fit))

While the chains are mostly mixed and stationary, there is a noticeable amount of divergences. We have followed the recommendation of increasing the parameter provided via adapt_delta closer to 1 (default is 0.95). This has reduced, but not eliminated the divergent transitions. In summary, the number of divergent transitions is something we take note of, but have to accept for now.

Predictive performance: PSIS-LOO

We can use Pareto smoothed importance sampling (PSIS) leave-one-out (LOO) cross-validation (CV) to assess the predictive performance of the linear model. This is done by applying LOO-CV to the log-likelihood matrix of the fitted model. We can then plot the estimated shape parameters k of the generalised Pareto distribution in order to check the reliability of the LOO expected log point-wise predictive (elpd) estimate:

loo_lin <- loo(log_lik(lin_fit))
par(bg = '#fffff8')
plot(loo_lin)

All k values are below 0.5, indicating quick convergence and solid predictive performance 5 Vehtari, Aki, Andrew Gelman, and Jonah Gabry. “Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC.” Statistics and Computing 27.5 (2017): 1413-1432..

Posterior predictions

With the linear model fitted, we can now draw (4000) posterior predictive samples to answer some of our questions around wages and the gender wage gap.

First, let’s have a look at the predicted wages. We show the median and central 95% interval for wages in each age cohort:

post_sample_grid <- expand.grid(AGEP = 25:50, 
                                sex = c("male", "female"), 
                                Major_Category = c("Education", "Health", "Business"), 
                                state = c("NY", "TX"),
                                KEEP.OUT.ATTRS = F,
                                stringsAsFactors = F)

log_post_lin <- posterior_predict(lin_fit, post_sample_grid, seed = 42)
exp_post_lin <- exp(log_post_lin)

post_df <- bind_cols(post_sample_grid, as.data.frame(t(exp_post_lin)))

post_long <- post_df %>% 
  gather(key = "sample_no", value = "predicted_earning", V1:V4000)

ci_post_pred <- post_long %>% 
  group_by(AGEP, sex, Major_Category, state) %>% 
  summarise(q_lower = quantile(predicted_earning, 0.025, names = F),
            median = quantile(predicted_earning, 0.5, names = F),
            q_upper = quantile(predicted_earning, 0.975, names = F),
            q90 = quantile(predicted_earning, 0.9, names = F)) %>% 
  mutate(model = "linear")

ci_post_pred %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = median, colour = sex)) +
  geom_line(aes(y = q_lower, colour = sex), linetype = "dashed") +
  geom_line(aes(y = q_upper, colour = sex), linetype = "dashed") +
  ggtitle("Posterior prediction (linear model): yearly wages (median & central 95% interval)") +
  xlab("age") +
  ylab("yearly wages in USD") +
  facet_grid(state ~ Major_Category)

At a first glance, we can already see that the model is not entirely unreasonable: it has generally replicated the trends we have seen in the real data, including the existence of a gender wage gap.

In order to show the context of real data more explicitly, we can plot the predictions together with the real data. Note that his time around, we plot the data median and the 90% quantile at each age point (instead of mean & two standard deviations earlier). Due to high wages being masked in our original dataset, the real data’s 97.5% wage quantiles look pretty peculiar (they form a horizontal line across all ages). As a consequence, the real data may look different in this plot compared to the earlier one.

data_intervals_median <- model_df1 %>% 
  filter(AGEP >= 20,
         AGEP <= 70) %>% 
  group_by(AGEP, sex, Major_Category, state) %>% 
  summarise(median = median(real_wage, na.rm = T),
            q90 = quantile(real_wage, 0.9, na.rm = T)) %>% 
  ungroup() %>% 
  mutate(model = "data")

bind_rows(ci_post_pred, data_intervals_median) %>% 
  mutate(plot_cat = paste0(str_sub(sex, 1, 1), " (", model, ")")) %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = median, colour = plot_cat)) +
  geom_line(aes(y = q90, colour = plot_cat), linetype = "dashed") + 
  scale_colour_manual(values = c("#e31a1c", "#fd8d3c", "#08306b", "#2171b5"),
                      name = "group") +
  ggtitle("Posterior prediction (linear model) & real data: yearly wages (median & 90% quantile)") +
  xlab("age") +
  ylab("yearly wages in USD") +
  facet_grid(state ~ Major_Category)

The median values are roughly comparable with the ones in the real data, however our posterior sample shows higher variation than is found in real wage data. This is especially prominent in the education and health cohorts – it seems like our particular “hierarchical” implementation does not allow the Gaussian noise of the different major categories to vary a lot.

Now that we have posterior samples, we can use them to investigate some specific aspects of the wage data by deriving quantities from the posterior sample. One such derivation is to calculate the difference of predicted yearly wages for men and women in each panel.

ci_diff<- post_long %>% 
  spread(key = "sex", value = "predicted_earning") %>% 
  mutate(diff_mf = male - female) %>%
  group_by(AGEP, Major_Category, state) %>%
  summarise(q_lower = quantile(diff_mf, 0.025, names = F),
            median = quantile(diff_mf, 0.5, names = F),
            q_upper = quantile(diff_mf, 0.975, names = F))

ci_diff %>%
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = median)) +
  geom_line(aes(y = q_lower), linetype = "dashed") +
  geom_line(aes(y = q_upper), linetype = "dashed") +
  ggtitle("Posterior prediction (linear model): wage difference (male - female; central 95% interval)") +
  xlab("age") +
  ylab("difference in yearly wages [USD]") +
  facet_grid(state ~ Major_Category)

It seems like there are disparities between the panels when looking at these wage differences. However, as we have alluded to earlier, this might be entirely due to the fact that the wage levels themselves also vary across the panels.

We can see if that is the case by investigating the wage ratio instead of the difference. This gives us a relative measure instead of an absolute one. A ratio like this is best visualised on the log2 scale:

ci_ratio <- post_long %>% 
  spread(key = "sex", value = "predicted_earning") %>% 
  mutate(ratio_mf = male / female) %>% 
  group_by(AGEP, Major_Category, state) %>% 
  summarise(q_lower = quantile(ratio_mf, 0.025, names = F),
            median = quantile(ratio_mf, 0.5, names = F),
            q_upper = quantile(ratio_mf, 0.975, names = F)) 

ci_ratio %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = log2(median))) +
  geom_line(aes(y = log2(q_lower)), linetype = "dashed") +
  geom_line(aes(y = log2(q_upper)), linetype = "dashed") +
  ggtitle("Posterior prediction (linear model): log2 ratio of wages (male / female; central 95% interval)") +
  xlab("age") +
  ylab("log2 ratio of yearly wages") +
  facet_grid(state ~ Major_Category) 

Unfortunately (in terms of interesting outcomes), it does not seem like we can identify any group differences in the relative gender wage gap trend with this linear model: the slope and location of the time series are practically identical across all six panels.

As unexciting as it may be, this happen-stance actually makes it easier to provide a summary of the predicted wage ratios for workers of different ages by simply taking the mean across all six panels:

ci_ratio %>% 
  filter(AGEP %in% seq(26, 50, 3)) %>% 
  group_by(AGEP) %>% 
  summarise(mean_lower = mean(q_lower),
            mean_50 = mean(median),
            mean_upper = mean(q_upper)) %>% 
  mutate(mean_lower = paste0("female ", round(1 / mean_lower, 2), "x higher"),
         mean_50 = paste0("male ", round(mean_50, 2), "x higher"),
         mean_upper = paste0("male ", round(mean_upper, 2), "x higher")) %>% 
  knitr::kable(., col.names = c("age", "2.5%", "50%", "97.5%"), align = "rccc")
age 2.5% 50% 97.5%
26 female 10.4x higher male 1.26x higher male 16.55x higher
29 female 9.76x higher male 1.33x higher male 17.57x higher
32 female 9.85x higher male 1.39x higher male 18.1x higher
35 female 9.3x higher male 1.45x higher male 19.26x higher
38 female 8.96x higher male 1.51x higher male 19.4x higher
41 female 8.62x higher male 1.55x higher male 21.36x higher
44 female 8.33x higher male 1.63x higher male 21.28x higher
47 female 7.82x higher male 1.76x higher male 22.14x higher
50 female 7.6x higher male 1.79x higher male 22.91x higher

This table shows the trends shown from the previous figure on a more tangible scale than the logarithmic scale: as men’s relative wages grow larger with age, the credibility intervals of wage ratios shift accordingly.

Additive mixed model: ramped up complexity

The linear model above is a fairly simple first approach to modelling our wage data. With the next model in this report, we increase the complexity substantially (arguably too much so).

The model is a hierarchical generalised additive mixed model (GAMM) that also uses the cubic B-spline basis function, which models the predictor (log-transformed wages) as a piecewise continuous function conditioned on a set of knot points (see BDA3 book 6 Gelman, A., Carlin, J.B., Stern, H. S., Dunson, D. B., Vehtari, A. and Rubin, D. B. Bayesian Data Analysis, Third Edition (Chapman & Hall/CRC Texts in Statistical Science)., chapter 20, pages 487-489 for more details).

Model fitting & convergence diagnostics

We fit the GAMM using the rstanarm function stan_gamm4(), using the following priors: N(0, 2.5) for regression coefficients, flat uniform for intercepts and exponential(1) for the smooth function.

Since this model should be able to pick up more trends than just linear growth during the relatively younger ages, we use a random sample (5000 entries) from the complete dataset to fit it.

set.seed(42)
prototype_spline_df <- model_df1 %>%
  sample_n(size = 5000)

log_GAMM <- stan_gamm4(log_wage ~ s(AGEP), 
                       random = ~(sex | Major_Category) + (sex | state),
                       prior_intercept = NULL, 
                       data = prototype_spline_df,
                       family = gaussian())

Again, let’s have a look at parameter estimates and convergence diagnostics:

knitr::kable(log_GAMM$stan_summary, digits = 2)
mean se_mean sd 2.5% 10% 25% 50% 75% 90% 97.5% n_eff Rhat
(Intercept) 10.58 0.01 0.34 9.85 10.23 10.42 10.58 10.75 10.94 11.25 858.84 1.00
s(AGEP).1 4.39 0.03 2.12 0.41 1.73 2.89 4.40 5.82 7.14 8.68 4000.00 1.00
s(AGEP).2 -0.11 0.06 2.20 -4.20 -2.92 -1.65 -0.17 1.40 2.73 4.18 1322.11 1.00
s(AGEP).3 -0.58 0.02 1.48 -3.55 -2.51 -1.58 -0.55 0.44 1.29 2.20 4000.00 1.00
s(AGEP).4 -0.63 0.05 1.63 -3.84 -2.81 -1.80 -0.60 0.59 1.47 2.31 1104.20 1.00
s(AGEP).5 6.27 0.03 1.28 3.88 4.65 5.41 6.23 7.09 7.93 8.93 2314.52 1.00
s(AGEP).6 -0.94 0.02 0.70 -2.34 -1.89 -1.42 -0.93 -0.42 -0.04 0.36 1465.38 1.00
s(AGEP).7 -4.83 0.01 0.70 -6.20 -5.75 -5.32 -4.84 -4.34 -3.93 -3.46 2570.28 1.00
s(AGEP).8 -6.48 0.06 2.14 -10.35 -9.27 -8.10 -6.48 -4.94 -3.61 -2.44 1107.59 1.00
s(AGEP).9 5.49 0.13 4.09 -0.26 0.14 1.86 5.27 8.39 10.98 14.02 1000.65 1.00
b[(Intercept) Major_Category:Business] 0.15 0.01 0.26 -0.36 -0.12 0.02 0.13 0.27 0.43 0.72 1030.59 1.00
b[sexmale Major_Category:Business] 0.28 0.01 0.15 -0.02 0.05 0.15 0.33 0.39 0.43 0.48 708.63 1.00
b[(Intercept) Major_Category:Education] -0.12 0.01 0.26 -0.63 -0.39 -0.25 -0.13 0.00 0.16 0.45 1038.92 1.00
b[sexmale Major_Category:Education] 0.09 0.01 0.15 -0.21 -0.12 -0.03 0.12 0.20 0.25 0.32 773.39 1.00
b[(Intercept) Major_Category:Health] 0.18 0.01 0.26 -0.33 -0.08 0.05 0.16 0.29 0.46 0.75 1031.20 1.00
b[sexmale Major_Category:Health] 0.25 0.01 0.16 -0.06 0.02 0.12 0.27 0.37 0.44 0.52 797.17 1.00
b[(Intercept) Major_Category:_NEW_Major_Category] -0.01 0.01 0.42 -0.89 -0.40 -0.18 -0.01 0.17 0.40 0.80 4000.00 1.00
b[sexmale Major_Category:_NEW_Major_Category] -0.02 0.01 0.41 -0.89 -0.43 -0.17 0.00 0.17 0.40 0.78 4000.00 1.00
b[(Intercept) state:NY] 0.02 0.01 0.23 -0.38 -0.12 -0.02 0.01 0.06 0.17 0.52 650.73 1.01
b[sexmale state:NY] 0.11 0.01 0.15 -0.07 -0.02 0.00 0.05 0.23 0.33 0.42 702.74 1.00
b[(Intercept) state:TX] -0.01 0.01 0.23 -0.43 -0.17 -0.06 -0.01 0.02 0.12 0.47 790.08 1.01
b[sexmale state:TX] 0.11 0.01 0.15 -0.07 -0.02 0.00 0.05 0.24 0.34 0.42 689.01 1.00
b[(Intercept) state:_NEW_state] 0.01 0.00 0.30 -0.56 -0.17 -0.05 0.00 0.05 0.18 0.58 4000.00 1.00
b[sexmale state:_NEW_state] 0.00 0.00 0.29 -0.63 -0.23 -0.06 0.00 0.06 0.23 0.60 3585.55 1.00
sigma 0.99 0.00 0.01 0.97 0.98 0.98 0.99 1.00 1.00 1.01 4000.00 1.00
smooth_sd[s(AGEP)1] 3.84 0.03 0.95 2.34 2.73 3.14 3.72 4.42 5.12 5.98 1269.33 1.00
smooth_sd[s(AGEP)2] 3.16 0.05 1.98 0.14 0.56 1.62 3.08 4.43 5.73 7.37 1311.70 1.00
Sigma[Major_Category:(Intercept),(Intercept)] 0.17 0.01 0.44 0.01 0.02 0.03 0.07 0.16 0.37 0.91 2836.82 1.00
Sigma[Major_Category:sexmale,(Intercept)] 0.04 0.00 0.18 -0.18 -0.05 0.00 0.02 0.06 0.15 0.39 2211.61 1.00
Sigma[Major_Category:sexmale,sexmale] 0.15 0.01 0.28 0.00 0.01 0.03 0.08 0.17 0.35 0.79 2045.92 1.00
Sigma[state:(Intercept),(Intercept)] 0.10 0.02 0.50 0.00 0.00 0.00 0.01 0.04 0.16 0.83 1058.05 1.00
Sigma[state:sexmale,(Intercept)] 0.00 0.00 0.12 -0.16 -0.04 0.00 0.00 0.00 0.04 0.21 1709.25 1.00
Sigma[state:sexmale,sexmale] 0.09 0.01 0.24 0.00 0.00 0.00 0.01 0.07 0.22 0.62 1925.53 1.00
mean_PPD 10.80 0.00 0.02 10.76 10.77 10.79 10.80 10.81 10.82 10.84 4000.00 1.00
log-posterior -7091.96 0.16 5.12 -7102.58 -7098.69 -7095.24 -7091.49 -7088.43 -7085.65 -7082.97 1087.36 1.00

Just as with the linear model, all Rhat values are below 1.1. This diagnostic indicates that the chains have converged.

Once again, we also inspect the trace plots of the sampling chains:

mcmc_trace(as.array(log_GAMM), np = nuts_params(log_GAMM))

Compared to the linear model, there are fewer divergent transitions (15), but of course it would be preferable to see even fewer of them.

Predictive performance: PSIS-LOO

While the number of divergent transitions has improved from the linear model, PSIS-LOO diagnostics are quite bad with the GAMM model:

loo_GAMM <- loo(log_lik(log_GAMM))
par(bg = '#fffff8')
plot(loo_GAMM)

cbind(c("good", "ok", "bad", "very bad"), loo::pareto_k_table(loo_GAMM)) %>% 
  knitr::kable(.)
Count Proportion
(-Inf, 0.5] good 53 0.0106
(0.5, 0.7] ok 2250 0.45
(0.7, 1] bad 1342 0.2684
(1, Inf) very bad 1355 0.271

Around 54% of the Pareto k values are bad or very bad, indicating either that the GAMM fit is poor or the model too complex.

While there are endless options for modifying model priors looking for an improvement, the few we have tried (flat priors for intercept and smooth function) did not improve the model fit.

We also tried to reduce the number of knot points from the 9 knots automatically picked by rstanarm to 3 knots located at the first, second and third quantiles, based on the premise that the high number of splines might be overkill. However, this also did not help regarding the PSIS-LOO diagnostic.

Posterior predictions

Even though the PSIS-LOO diagnostic indicates a poor fit, we will power through for now. This means that we again draw (4000) posterior predictive samples from the fitted model.

The first thing we investigate are the central 95% intervals for the predicted wages:

GAMM_sample_grid <- expand.grid(AGEP = 20:70, 
                                sex = c("male", "female"), 
                                Major_Category = c("Education", "Health", "Business"), 
                                state = c("NY", "TX"),
                                KEEP.OUT.ATTRS = F,
                                stringsAsFactors = F)

log_post_GAMM <- posterior_predict(log_GAMM, GAMM_sample_grid, seed = 42)
exp_post_GAMM <- exp(log_post_GAMM)

post_GAMM_df <- bind_cols(GAMM_sample_grid, as.data.frame(t(exp_post_GAMM)))

post_GAMM_long <- post_GAMM_df %>% 
  gather(key = "sample_no", value = "predicted_earning", V1:V4000)

ci_post_GAMM_pred <- post_GAMM_long %>% 
  group_by(AGEP, sex, Major_Category, state) %>% 
  summarise(q_lower = quantile(predicted_earning, 0.025, names = F),
            median = quantile(predicted_earning, 0.5, names = F),
            q_upper = quantile(predicted_earning, 0.975, names = F),
            q90 = quantile(predicted_earning, 0.9, names = F)) %>% 
  mutate(model = "GAMM")

ci_post_GAMM_pred %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = median, colour = sex)) +
  geom_line(aes(y = q_lower, colour = sex), linetype = "dashed") +
  geom_line(aes(y = q_upper, colour = sex), linetype = "dashed") +
  ggtitle("Posterior prediction (GAMM): yearly wages (median & central 95% interval)") +
  xlab("age") +
  ylab("yearly wages in USD") +
  facet_grid(state ~ Major_Category)

It appears that the GAMM has picked up the different earning potentials of the three major categories – business and health are highest, with education trailing behind.

While women’s predicted wages are consistently lower then men’s, the shape of the two curves does not dramatically differ. At most, we could say that the wage curve for women is a bit flatter than the one for men.

We can also once again show the predicted wages in the context of the real data:

bind_rows(ci_post_GAMM_pred, data_intervals_median) %>% 
  mutate(plot_cat = paste0(str_sub(sex, 1, 1), " (", model, ")")) %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = median, colour = plot_cat)) +
  geom_line(aes(y = q90, colour = plot_cat), linetype = "dashed") + 
  scale_colour_manual(values = c("#e31a1c", "#fd8d3c", "#08306b", "#2171b5"),
                      name = "group") +
  ggtitle("Posterior prediction (GAMM) & real data: yearly wages (median & 90% quantile)") +
  xlab("age") +
  ylab("yearly wages in USD") +
  facet_grid(state ~ Major_Category)

Across all panels, we observe the same problem that was present with the linear model: a predicted wage variation that is substantially higher than the one present in the real data. In addition, it seems like the GAMM is not able to pick up that real-world median wages don’t seem to drop off for health graduates of older ages. Those things aside, we can say that at least the predicted median wages track real-world medians reasonably well.

Just as we did when evaluating the linear model, we can use the posterior samples to describe the gender wage gap. We will skip showing the differences in absolute wages this time and go straight to the relative wage ratios:

ci_GAMM_ratio <- post_GAMM_long %>% 
  spread(key = "sex", value = "predicted_earning") %>% 
  mutate(ratio_mf = male / female) %>% 
  group_by(AGEP, Major_Category, state) %>% 
  summarise(q_lower = quantile(ratio_mf, 0.025, names = F),
            median = quantile(ratio_mf, 0.5, names = F),
            q_upper = quantile(ratio_mf, 0.975, names = F)) 

ci_GAMM_ratio %>% 
  ggplot(., aes(x = AGEP)) +
  geom_line(aes(y = log2(median))) +
  geom_line(aes(y = log2(q_lower)), linetype = "dashed") +
  geom_line(aes(y = log2(q_upper)), linetype = "dashed") +
  ggtitle("Posterior prediction (GAMM): log2 ratio of wages (male / female; central 95% interval)") +
  xlab("age") +
  ylab("log2 ratio of yearly wages") +
  facet_grid(state ~ Major_Category) 

While the linear model showed an increase in the wage gap toward higher ages, but no differences between the panels, we have the opposite problem with the GAMM: the time series lines are pretty much horizontal in all panels, their vertical position, however, differs when comparing the different major categories.

Once again, we can make our observations from the ratio plot more explicit by summarising the interval boundaries and the median for specific groups. This time, we group by age as well as major category. These groupings show the situation off fairly well – ratios within one major category are constant across all ages, but there are differences between the major categories:

ci_GAMM_ratio %>% 
  filter(AGEP %in% c(25, 35, 45, 55)) %>% 
  group_by(Major_Category, AGEP) %>% 
  summarise(mean_lower = mean(q_lower),
            mean_50 = mean(median),
            mean_upper = mean(q_upper)) %>% 
  mutate(mean_lower = paste0("female ", round(1 / mean_lower, 2), "x higher"),
         mean_50 = paste0("male ", round(mean_50, 2), "x higher"),
         mean_upper = paste0("male ", round(mean_upper, 2), "x higher")) %>% 
  knitr::kable(., col.names = c("major category", "age", "2.5%", "50%", "97.5%"), 
               align = "lrccc")
major category age 2.5% 50% 97.5%
Business 25 female 9.54x higher male 1.56x higher male 22.64x higher
Business 35 female 9.92x higher male 1.48x higher male 21.8x higher
Business 45 female 10.58x higher male 1.49x higher male 24.46x higher
Business 55 female 10.7x higher male 1.5x higher male 24.61x higher
Education 25 female 12.79x higher male 1.22x higher male 18.54x higher
Education 35 female 12.63x higher male 1.23x higher male 19.78x higher
Education 45 female 12.72x higher male 1.21x higher male 19x higher
Education 55 female 12.17x higher male 1.25x higher male 18.88x higher
Health 25 female 10.9x higher male 1.44x higher male 24x higher
Health 35 female 10.4x higher male 1.45x higher male 21.76x higher
Health 45 female 10.38x higher male 1.49x higher male 22.5x higher
Health 55 female 10.83x higher male 1.48x higher male 22.96x higher

Conclusions and thoughts

As you probably have realised by now, we did not exaggerate by mentioning in the introduction that we cannot claim that the presented work provides groundbreaking answers to any unsolved research questions surrounding the gender wage gap.

Unfortunately, the models we ran have not provided any substantial findings. We could demonstrate that the wage gap exists across all ages and backgrounds, but frankly, that’s an observation which could probably be made just by looking at the census data summary plots that we provided early on in the report – although, to be fair, creating those actually did require a non-zero amount of data cleaning work.

So what went wrong?

Let’s deal with the elephant in the room right away: we did not do ourselves any favours only using rstanarm functions instead of writing out explicit Stan models. The downsides here are (at least) twofold.

First off, the flexibility during model construction, allowed by sampling methods such as Hamiltonian Monte Carlo, is arguably the biggest strength of modern Bayesian methods. While rstanarm allows plenty of flexibility for many standard applications like the regression models we have attempted to use, it is inherently less flexible than “proper” Stan models are.

Furthermore, using rstanarm functions introduces a level of abstraction into the process of building, fitting and checking models. Even though there have been fairly obvious problems with some of the models we tried in this project – frankly, even with the ones that made it into this report – we often found ourselves somewhat unable to tackle these problems within the syntax of rstanarm. Obviously, most of this is due to our inexperience with the Bayesian workflow in real-world settings and it would be unfair to just blame all our shortcomings on the interface we used. However, for better or worse, the workflow seems to be more immediate and sometimes easier to figure out when writing explicit Stan models.

Speaking of workflow, it turns out this workflow is a lot harder to perform when there is no clear analysis path laid out for you (like it was during the assignments). In hindsight, this particular dataset was probably a bit too large for beginners such as ourselves. Because of the sheer amount of data entries, which was accompanied by insecurity about whether and how to reduce the amount of data – we also considered using summary statistics instead of individual records – it was actually not trivial to quickly try out new model set-ups or priors: While prototypes could be fitted within seconds during the assignments, sampling run-time easily exceeded an hour for some of the models we tried here.

Bottom line, while the project didn’t really lead to any research findings, we have learned plenty about applying the Bayesian workflow in practice. We will leave the research – at least the one in this field – to the experts.